Conversation
- The other scores should be tweaked as well. - FactualScore link ishttps://docs.ragas.io/en/latest/concepts/metrics/available_metrics/factual_correctness/#factual-correctness
WalkthroughImplements a multi-phase evaluation: batched execution to populate responses, compute factual_correctness across all items, filter out rows with Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Orchestrator as Evaluation
participant Executor
participant Engine as LLM Engine
participant EvalDS as EvaluationDataset
participant MetricsRunner as Metrics
participant Storage as Persist
rect rgba(200,230,255,0.3)
Note over Orchestrator,Executor: Phase 1 — Populate responses (batched)
Orchestrator->>Executor: build RunConfig & batch queries
Executor->>Engine: execute queries
Engine-->>Executor: responses per item
Executor-->>Orchestrator: populated EvalDS
end
rect rgba(235,255,200,0.3)
Note over Orchestrator,MetricsRunner: Phase 2 — Factual correctness (all rows)
Orchestrator->>MetricsRunner: run FactualCorrectness on EvalDS
MetricsRunner->>Engine: evaluator LLM calls (if required)
MetricsRunner-->>Orchestrator: factual_correctness results
end
rect rgba(255,230,230,0.3)
Note over Orchestrator,EvalDS: Phase 3 — Filter & core metrics
Orchestrator->>EvalDS: filter out `NO_ANSWER_REFERENCE` rows
Orchestrator->>MetricsRunner: run core metrics on filtered EvalDS
MetricsRunner-->>Orchestrator: core metric results
end
rect rgba(240,240,240,0.4)
Note over Orchestrator,Storage: Phase 4 — Merge & persist
Orchestrator->>Orchestrator: merge factual + core metrics into full table
Orchestrator->>Storage: _persist_cost(results:list|single) -> aggregated cost payload
Storage-->>Orchestrator: persisted
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✨ Finishing touches🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
- Added functionality to evaluate metrics on datasets with populated responses without re-querying the engine. - Implemented filtering for core metrics to exclude rows where both reference and response are marked as NO_ANSWER_REFERENCE. - Updated cost persistence to handle multiple evaluation results and improved logging for better traceability. - Refactored the evaluation method to allow for metric overrides, enhancing flexibility in evaluation configurations.
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (5)
evaluation/evaluation.py (5)
82-82: Remove extraneousfprefix from logging statement.The f-string has no placeholders.
Apply this diff:
- logging.info(f"Running queries to get responses for all rows...") + logging.info("Running queries to get responses for all rows...")
171-171: Consider using iterable unpacking for concatenation.While the current concatenation is functional, iterable unpacking would be more Pythonic and slightly more efficient.
Apply this diff:
merged = df_fc.merge( - df_core[["user_input", "reference", "response"] + core_cols], + df_core[[*["user_input", "reference", "response"], *core_cols]], on=["user_input", "reference", "response"], how="left", )
176-176: Remove extraneousfprefixes from logging statements.These f-strings have no placeholders.
Apply this diff:
- logging.info(f"Persisting results to results.csv") + logging.info("Persisting results to results.csv") merged.to_csv("results.csv", index=False) - logging.info(f"Persisting cost information to results_cost.json...") + logging.info("Persisting cost information to results_cost.json...") self._persist_cost([result_fc, result_core], "results_cost.json")Also applies to: 179-179
215-216: Log exceptions instead of silently passing.Silently catching and ignoring exceptions when computing token usage makes debugging difficult and could hide issues with cost tracking.
Apply this diff:
try: tokens = cb.total_tokens() if tokens is not None: if total_tokens_obj is None: total_tokens_obj = tokens else: # merge total_tokens_obj.input_tokens += tokens.input_tokens total_tokens_obj.output_tokens += tokens.output_tokens - except Exception: - pass + except Exception: + logging.exception("Failed computing total tokens from cost callback")
269-293: Consider extracting metric mapping to reduce code duplication.The
name_to_metricdictionary construction is duplicated between_evaluate()and_evaluate_metrics_only(). Extracting this logic to a helper method would improve maintainability.Consider refactoring like this:
def _get_metrics(self, metrics_override: list[str] | None = None): """Build and return the requested metrics.""" evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model=self.model)) name_to_metric = { "faithfulness": Faithfulness(llm=evaluator_llm), "answer_relevancy": AnswerRelevancy(llm=evaluator_llm), "context_precision": ContextPrecision(llm=evaluator_llm), "context_recall": ContextRecall(llm=evaluator_llm), "factual_correctness": FactualCorrectness(llm=evaluator_llm), } if metrics_override is None: return list(name_to_metric.values()) else: return [name_to_metric[m] for m in metrics_override] def _evaluate(self, wrapped_engine, evaluation_dataset, metrics_override: list[str] | None = None) -> EvaluationResult: metrics = self._get_metrics(metrics_override) result = evaluate( query_engine=wrapped_engine, metrics=metrics, dataset=evaluation_dataset, token_usage_parser=get_token_usage_for_openai, ) return result def _evaluate_metrics_only(self, evaluation_dataset, metrics_override: list[str] | None = None) -> EvaluationResult: """ Evaluate metrics on a dataset that already has responses populated. Does not re-query the engine. """ metrics = self._get_metrics(metrics_override) result = ragas_evaluate( dataset=evaluation_dataset, metrics=metrics, token_usage_parser=get_token_usage_for_openai, ) return result
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
evaluation/evaluation.py(5 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
evaluation/evaluation.py (1)
worker/tasks.py (1)
query_data_sources(61-140)
🪛 Ruff (0.13.3)
evaluation/evaluation.py
82-82: f-string without any placeholders
Remove extraneous f prefix
(F541)
171-171: Consider iterable unpacking instead of concatenation
Replace with iterable unpacking
(RUF005)
176-176: f-string without any placeholders
Remove extraneous f prefix
(F541)
179-179: f-string without any placeholders
Remove extraneous f prefix
(F541)
215-216: try-except-pass detected, consider logging the exception
(S110)
215-215: Do not catch blind exception: Exception
(BLE001)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: evaluate
- GitHub Check: ci / test / Test
🔇 Additional comments (3)
evaluation/evaluation.py (3)
170-174: Verify merge logic handles duplicate user_input+reference+response combinations.The merge uses
["user_input", "reference", "response"]as join keys. If the filtered dataset contains duplicate combinations of these three fields, the merge could produce unexpected duplicate rows or incorrect metric associations.Consider adding a check before the merge to ensure uniqueness:
+ # Verify uniqueness of join keys in both dataframes + if df_fc[["user_input", "reference", "response"]].duplicated().any(): + logging.warning("Duplicate join keys found in factual_correctness results") + if df_core[["user_input", "reference", "response"]].duplicated().any(): + logging.warning("Duplicate join keys found in core metrics results") + merged = df_fc.merge( df_core[["user_input", "reference", "response"] + core_cols], on=["user_input", "reference", "response"], how="left", )
247-267: LGTM!The addition of
metrics_overrideparameter provides flexibility in selecting which metrics to evaluate. The implementation correctly maps metric names to their instances and handles both full evaluation and subset evaluation cases.
82-106: Verify Executor.results() ordering
The code assumesexec.results()returns results in the same order as submissions, mapping directly toevaluation_dataset_all.samples. Confirm thatExecutor.results()preserves submission order or adapt the logic to align by taskname(e.g.,query-{i}).
- Updated the evaluation metrics to include 'factual_correctness(mode=f1)', enhancing the assessment capabilities of the evaluation workflow. - This addition allows for a more comprehensive analysis of the evaluation results.
Summary by CodeRabbit
New Features
Bug Fixes / Improvements