feat: img-saver-extended #490

Marius-Graml · 2026-01-13T09:39:30Z

Description

Added image artifactsaver and corresponding tests. Also added a small method for the optimization agent in evaluation_agent.py which creates a json file mapping the input prompts for the model to the generated output images. This allows to log the generated images with the corresponding prompt as file name.

Related Issue

/

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

Image artifact saver: Tests implemented
Json-file creation for optimization agent: Checked the created json-file manually

Checklist

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Additional Notes

PR is a combination of the branch "algo-sweeper" and "image-artifactsaver". Originally, I started this branch based on Begüm's branch where the video artifactsaver was implemented. It seems that there were a few bugs in this branch which are now also contained in this branch.

…ameters

… in the metric implementations.

… handler

* feat: remove algorithm groups from algorithms folder * feat: simply new algorithm registration to smash space * refactor: add new smash config interface * refactor: remove unused tokenizer name function * refactor: adjust order implementation * feat: add new graph-based path finding for algorithm execution order * tests: add first version of pre-smash-routines tests * tests: narrow down pre-smash routine tests * refactor: rename PRUNA_ALGORITHMS * refactor: enhance algorithm tags * refactor: remove `incompatible` specification * feat: add `smash_config` utility * style: initial fix all linting complaints * tests: adjust test structure to new refactoring * style: address PR comments * fix: conditionally register algorithms * fix: adjust smash config access in algorithms * fix: support older smash configs * fix: handle target module exception * fix: deprecated save/load imports * tests: update to fit recent interface changes * fix: add `global_utils` exception to algorithm registry * fix: extending compatible methods * fix: deprecate old hyperparameter interface properly * tests: add symmetry checks for algorithm order * style: address PR comments * feat: add utility to register custom algorithm * fix: insufficient docstring descriptions * fix: test references to HQQ * style: fix remaining linting errors * style: fix typing error w.r.t. compatibility setter * style: import sorting * fix: return type of registry function * fix: model context docstring * fix: some final bugs * fix: duplicate pyproject.toml key * fix: address cursorbot slander * style: move inline comments * fix: unify registry logic * feat: additional check in algorithm order overwrite * fix: documentation wording * fix: device function patching in tests

… in prime intellect

* update pre-commit * rm redudant filters. * fix nits and whitespacing issues. * Update versions

…nclude batch_idx in metadata

…ion_agent.py

…_agent.py

… have one branch for optimization agent and image artifact saver

cursor

Comment @cursor review or bugbot run to trigger another review on this PR

cursor · 2026-01-13T09:47:12Z

src/pruna/evaluation/evaluation_agent.py

+import json
+import tempfile
+from pathlib import Path
+from traceback import print_tb


Unused debug import left in code

Low Severity

The import from traceback import print_tb is added but never used anywhere in the file. This appears to be leftover debugging code that was accidentally included in the commit.

cursor · 2026-01-13T09:47:12Z

src/pruna/evaluation/artifactsavers/image_artifactsaver.py

+        # Usually, the data is already a PIL.Image, so we don't need to convert it.
+        if isinstance(data, torch.Tensor):
+            data = np.transpose(data.cpu().numpy(), (1, 2, 0))
+            data = np.clip(data * 255, 0, 255).astype(np.uint8)


uint8 tensors corrupted by unconditional scaling

High Severity

When saving tensor data, the code unconditionally multiplies values by 255, assuming the input is in [0, 1] float range. If a uint8 tensor (values 0-255) is passed, this corrupts the data — for example, a pixel value of 2 becomes 2 * 255 = 510, which clips to 255. Effectively, all non-zero values become 255, turning the image into a near-binary output. The tests only verify file existence, not content correctness.

Additional Locations (1)

src/pruna/evaluation/artifactsavers/video_artifactsaver.py#L69-L72

cursor · 2026-01-13T09:47:12Z

src/pruna/evaluation/metrics/metric_vbench_background_consistency.py

+    # https://github.com/Vchitect/VBench/blob/dc62783c0fb4fd333249c0b669027fe102696682/evaluate.py#L111
+    # explicitly sets the device to cuda. We respect this here.
+    runs_on: List[str] = ["cuda"]
+    modality: List[str] = ["video"]


VBench metrics use wrong type for modality

Medium Severity

The VBench metrics declare modality: List[str] = ["video"] but the base class StatefulMetric declares modality: set[str]. All other metrics in the codebase (CMMD, PairwiseClipScore, SharpnessMetric) use sets like modality = {IMAGE}. While set.intersection() in validate_and_get_task_modality happens to work with lists, this type inconsistency violates the interface contract and the constant VIDEO from utils.py is not being used.

Additional Locations (1)

src/pruna/evaluation/metrics/metric_vbench_dynamic_degree.py#L92-L93

cursor · 2026-01-13T09:47:12Z

tests/evaluation/test_vbench_metrics.py

+    # Test 3D tensor (should fail)
+    invalid_tensor = torch.randn(2, 3, 16)
+    with pytest.raises(ValueError, match="4 or 5 dimensional"):
+        metric.validate_batch(invalid_tensor)


Tests call non-existent validate_batch method

Medium Severity

The test test_vbench_metrics_invalid_tensor_dimensions calls metric.validate_batch(invalid_tensor), but neither VBenchBackgroundConsistency nor VBenchDynamicDegree implements a validate_batch method. The test expects a ValueError with message "4 or 5 dimensional" but will instead raise AttributeError. The test comments also reference validate_batch for 4D tensor conversion, indicating this method was expected but never implemented.

cursor · 2026-01-13T09:47:12Z

src/pruna/evaluation/metrics/metric_vbench_dynamic_degree.py

+        #  So we need to convert the arguments to an EasyDict.
+        args_new = EasyDict({"model": model_path, "small": False, "mixed_precision": False, "alternate_corr": False})
+        self.DynamicDegree = PrunaDynamicDegree(args_new, device)
+        self.add_state("scores", [])


interval parameter silently ignored in VBenchDynamicDegree

Medium Severity

The VBenchDynamicDegree.__init__ accepts an interval parameter via **kwargs but never uses it. The parameter is absorbed into kwargs and passed to the parent __init__ but is not used when creating PrunaDynamicDegree. Tests explicitly pass different interval values (1, 3, 4, 5, 10) expecting different sampling behavior, but all values produce identical results because the parameter has no effect.

cursor · 2026-01-13T09:47:12Z

src/pruna/evaluation/metrics/metric_vbench_background_consistency.py

+        MetricResult
+            The final score.
+        """
+        score = self.similarity_scores / self.n_samples


Division by zero and empty mean return nan

Medium Severity

Both VBench metrics lack validation for empty state before computing results. VBenchBackgroundConsistency.compute() performs self.similarity_scores / self.n_samples where n_samples is initialized to 0, causing division by zero and returning nan. Similarly, VBenchDynamicDegree.compute() calls np.mean(self.scores) where scores is initialized to an empty list, also returning nan. The test test_vbench_metrics_compute_without_updates expects 0.0 in both cases, but will fail due to these missing zero-checks.

Additional Locations (1)

src/pruna/evaluation/metrics/metric_vbench_dynamic_degree.py#L162-L163

begumcig and others added 30 commits November 27, 2025 13:18

feat: 2 vbench dimensions and vbench dependencies

4d5d496

test: vbench metric tests

b8d0392

docs: add more comprehensive docstring explanations for important par…

d350a70

…ameters

feat: add additional helper tools to utilities

d16c583

refactor: small updates to utilities and docstrings

82342cb

refactor: add support for more calltypes in video eval utils

102e91f

refactor: make utilities more vbench independent and fix small things…

2f579fd

… in the metric implementations.

refactor: address PR comments

9abf898

test: adding more tests for dynamic degree and background consistency

e4bc717

feat: artifact saving and vbench related agent updates

2f37e88

test: add tests for the artifact savers

b8f3e8d

test: add artifact related evaluation tests and task modality tests

d8c7a28

refactor: add some comments

fb15eed

refactor: better initialization for artifact savers

c2c1d27

test: add more dtype tests for artifact saver

ceead22

feat: metric modalities as sets

2329abb

refactor: comments tests task modality

34a9e82

feat: add video inference support and seeding strategies to inference…

6e3eb76

… handler

feat: remove per evaluation seed and add tests

a3f7cbc

chore: add comments

c2259cc

fix: bfloats cannot be moved to cpu error in cmmd metric

8a340bb

fix: pre commit file fix

2441e9d

refactor: configure seeding and tests

52133e1

feat:stratification to vbench datasets

566ebb9

feat: data stratification by indexing

b581c31

Add image artifactsaver and modified utils to use it for algo sweeper…

f2059b9

… in prime intellect

Changes to vbench-utils via ruff

7d92f78

Add filename sanitizer which modifies invalid filename´aliases

6a58933

Change file format via ruff

e821e47

Marius Graml and others added 11 commits December 10, 2025 16:06

Add helper function for creating aliases as prompt names for outputs

8879195

Enable TruffleHog in pre-commit (#439)

40d09bd

* update pre-commit * rm redudant filters. * fix nits and whitespacing issues. * Update versions

feat: 0 vbench dimensions and vbench dependencies

607edef

Undo limiting prompt length as name for image logging

8287501

Integrate uncommented method from task.py

c019dd6

Update function in evaluation agent for generating metadata-json file

4a968af

Add evaluation-agent parameter for optional JSON metadata creation; i…

fc3d21c

…nclude batch_idx in metadata

Adjust doc string for _maybe_create_input_output_metadat() in evaluat…

e3b19ed

…ion_agent.py

Add model role in metadata json when applying pairwise metrics

f177313

Change name and if logic of json file creation function in evaluation…

f936d5f

…_agent.py

Copy the tests from image-artifactsaver to feat/img-saver-extended to…

5a5d3e7

… have one branch for optimization agent and image artifact saver

cursor bot reviewed Jan 13, 2026

View reviewed changes

Marius-Graml requested review from begumcig and johannaSommer and removed request for begumcig and johannaSommer January 13, 2026 09:49

Remove unused import

3ae00ee

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: img-saver-extended #490

feat: img-saver-extended #490

Uh oh!

Marius-Graml commented Jan 13, 2026

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Jan 13, 2026

Uh oh!

cursor bot Jan 13, 2026

Uh oh!

cursor bot Jan 13, 2026

Uh oh!

cursor bot Jan 13, 2026

Uh oh!

cursor bot Jan 13, 2026

Uh oh!

cursor bot Jan 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

feat: img-saver-extended #490

Are you sure you want to change the base?

feat: img-saver-extended #490

Uh oh!

Conversation

Marius-Graml commented Jan 13, 2026

Description

Related Issue

Type of Change

How Has This Been Tested?

Checklist

Additional Notes

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Jan 13, 2026

Choose a reason for hiding this comment

Unused debug import left in code

Uh oh!

cursor bot Jan 13, 2026

Choose a reason for hiding this comment

uint8 tensors corrupted by unconditional scaling

Uh oh!

cursor bot Jan 13, 2026

Choose a reason for hiding this comment

VBench metrics use wrong type for modality

Uh oh!

cursor bot Jan 13, 2026

Choose a reason for hiding this comment

Tests call non-existent validate_batch method

Uh oh!

cursor bot Jan 13, 2026

Choose a reason for hiding this comment

interval parameter silently ignored in VBenchDynamicDegree

Uh oh!

cursor bot Jan 13, 2026

Choose a reason for hiding this comment

Division by zero and empty mean return nan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants