Decouple Evaluations from LLM Runs (Post-hoc Evaluation) #3366

zdzh · 2026-01-08T09:59:33Z

zdzh
Jan 8, 2026

🚀 Motivation & Purpose

Currently, in Agenta, the execution of Evaluators is tightly coupled with the execution of the LLM application (the "Run"). If a user wants to apply a different evaluation metric (e.g., switching from "Exact Match" to "Jaccard Similarity") to analyze existing outputs, the system currently requires a full re-run of the test set.

Objective: To enable users to run or re-run Evaluators on existing generated outputs without re-triggering LLM calls, thereby saving time and API tokens.

💡 Core Feature Requirements

Post-hoc Evaluation: The ability to select a completed Evaluation Run and apply new or updated Evaluators to its stored output data.
On-demand Recalculation: * Add a new Custom Code Evaluator to historical results.

Modify parameters (e.g., similarity thresholds) of an existing evaluator and refresh the scores instantly.

Evaluation Traceability: Maintain a link between the original generation (Prompt/Parameters) and the multiple versions of evaluations performed on it.

🛠 Proposed Workflow / Logic

Current Data Flow:
`Input -> [LLM App Call] -> Output -> [Evaluator] -> Score`

Proposed "Offline Eval" Mode:
`Stored Output (from DB) -> [New/Updated Evaluator] -> Updated Score`

💎 Expected Value

Cost Efficiency: Eliminates redundant LLM API costs when only the evaluation criteria are changing.
Rapid Iteration: Allows for "Evaluator Tuning." Developers can refine their evaluation scripts and see results in seconds rather than waiting for model generations.
Deep Analytics: Enables data scientists to retroactively analyze old experiments with new metrics as the project's success criteria evolve.

🙋 Discussion Points for the Team

Does the current architecture support running Evaluators independently of the Trace/Run lifecycle?
How should the UI handle cases where the target or inputs might have changed since the original run?
Is there a plan to introduce an "Evaluation Sandbox" where users can test evaluators on cached data?

mmabrouk · 2026-01-08T15:56:58Z

mmabrouk
Jan 8, 2026
Maintainer

Hello @zdzh and thanks for your contribution!

Let me summarize this feature request into two main points:

The ability to add evaluators to a past evaluation run
The ability to update evaluators and run them on past evaluation runs

We already planned this feature, and our platform supports it. Our data model allows adding evaluators, updating test sets (and rerunning only the changed parts), and rerunning single cases.

Does the current architecture support running Evaluators independently of the Trace/Run lifecycle?

Yes. The evaluation run data model supports running evaluators on test cases directly, without needing to run the LLM app itself. It also supports adding new evaluators. However, the evaluation orchestrator and the UI don't provide actions for this yet.

How should the UI handle cases where the target or inputs might have changed since the original run?

Our evaluation run model handles this. It allows updating test sets and rerunning an evaluation without rerunning everything; it only reruns the parts that changed. Again, the data model supports this, but the orchestrator doesn't have this capability yet.

Is there a plan to introduce an "Evaluation Sandbox" where users can test evaluators on cached data?

Can you describe this in more detail? What would the flow look like? Right now we have an evaluator playground that lets you load a test case and experiment with it. Do you have something different in mind? For instance, running an evaluation on the evaluators themselves?

0 replies

zdzh · 2026-01-09T01:38:21Z

zdzh
Jan 9, 2026
Author

Thanks for the clarification — that makes sense 👍
Good to know post-hoc evaluation is already supported at the data model level.
The missing UI/orchestrator piece is exactly what I was running into. Appreciate the response!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Decouple Evaluations from LLM Runs (Post-hoc Evaluation) #3366

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Decouple Evaluations from LLM Runs (Post-hoc Evaluation) #3366

Uh oh!

zdzh Jan 8, 2026

🚀 Motivation & Purpose

💡 Core Feature Requirements

🛠 Proposed Workflow / Logic

﻿ Current Data Flow: Input -> [LLM App Call] -> Output -> [Evaluator] -> Score ﻿ Proposed "Offline Eval" Mode: Stored Output (from DB) -> [New/Updated Evaluator] -> Updated Score ﻿ ﻿

💎 Expected Value

🙋 Discussion Points for the Team

Replies: 2 comments

Uh oh!

mmabrouk Jan 8, 2026 Maintainer

Uh oh!

zdzh Jan 9, 2026 Author

zdzh
Jan 8, 2026

Current Data Flow:
`Input -> [LLM App Call] -> Output -> [Evaluator] -> Score`

Proposed "Offline Eval" Mode:
`Stored Output (from DB) -> [New/Updated Evaluator] -> Updated Score`

mmabrouk
Jan 8, 2026
Maintainer

zdzh
Jan 9, 2026
Author