Replies: 2 comments
-
|
Hello @zdzh and thanks for your contribution! Let me summarize this feature request into two main points:
We already planned this feature, and our platform supports it. Our data model allows adding evaluators, updating test sets (and rerunning only the changed parts), and rerunning single cases.
Yes. The evaluation run data model supports running evaluators on test cases directly, without needing to run the LLM app itself. It also supports adding new evaluators. However, the evaluation orchestrator and the UI don't provide actions for this yet.
Our evaluation run model handles this. It allows updating test sets and rerunning an evaluation without rerunning everything; it only reruns the parts that changed. Again, the data model supports this, but the orchestrator doesn't have this capability yet.
Can you describe this in more detail? What would the flow look like? Right now we have an evaluator playground that lets you load a test case and experiment with it. Do you have something different in mind? For instance, running an evaluation on the evaluators themselves? |
Beta Was this translation helpful? Give feedback.
-
|
Thanks for the clarification — that makes sense 👍 |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
🚀 Motivation & Purpose
Currently, in Agenta, the execution of Evaluators is tightly coupled with the execution of the LLM application (the "Run"). If a user wants to apply a different evaluation metric (e.g., switching from "Exact Match" to "Jaccard Similarity") to analyze existing outputs, the system currently requires a full re-run of the test set.
Objective: To enable users to run or re-run Evaluators on existing generated outputs without re-triggering LLM calls, thereby saving time and API tokens.
💡 Core Feature Requirements
outputdata.
🛠 Proposed Workflow / Logic
Current Data Flow:
Input -> [LLM App Call] -> Output -> [Evaluator] -> Score
Proposed "Offline Eval" Mode:
Stored Output (from DB) -> [New/Updated Evaluator] -> Updated Score
💎 Expected Value
🙋 Discussion Points for the Team
targetorinputsmight have changed since the original run?
Beta Was this translation helpful? Give feedback.
All reactions